NRG-GY003 WES Analysis

Whole Exome genomic analysis of NRG-GY003 ovarian cancer tumors

Genomic analysis of tumor exomes from patients with recurrent ovarian cancer treated with immune checkpoint inhibitors (NIVO or IPI/NIVO) treated on the NRG-GY003 clinical trial.
Author
Affiliation

Kelsey Monson, PhD, MS

Icahn School of Medicine at Mount Sinai

Published

July 31, 2025

Keywords

Genomics, Whole Exome, Immunotherapy, R, Ovarian Cancer

Intro

Variant Filtering Strategy

I first used the Nextflow Sarek pipeline to align the raw fastq files. Below is the workflow I used:

Note that, at this time, I have only merged the variants from the SNP/indel callers, as ASCAT and the other SV/CNV callers do not have .vcfs as their output.

Here is the detailed variant filtering strategy I used to come up with the MAF file used in this analysis:

  • First find consensus somatic calls:

    • Present in >=2 somatic callers (Freebayes, Mutect2, Strelka2)
    • Include variants annotated with PASS
      • PASS: indicates the variant passed all quality filters applied by the variant caller

      • Considered the gold-standard for high-quality variants passing the variant caller’s built-in quality control filters

    • Further refine PASS variants
      • Minimum tumor Read Depth = 20: ensures sufficient coverage (i.e. evidence) to confidently call tumor variants from normal and/or distinguish from random sequencing errors

      • Allele frequency 0.05 < AF < 0.95:

        • >0.05: also helps avoid random sequencing errors/miss-called variants

        • <0.95: helps avoid germline contamination

  • Identify rare pathogenic germline variants:

    • Present in >=2 germline callers (Freebayes, Haplotypecaller, Strelka2)

    • Identify those most likely to be pathogenic

      • Limit to protein-coding variants (drop all intergenic and non-coding variants)

      • Must be annotated with HIGH impact

    • Eliminate common variants

      • “Rare” variants are typically those present in <1% of the population

      • Because I was worried about missing variants, I initially excluded those with >5% population frequency, but this was too liberal.

      • I revised the script to set the threshold to 1% and will re-generate the MAF file eventually

      • For the current analysis, I QC’d the variants and only identified likey oncogenic variants in BRCA1/2, so limited the germline analysis to those genes.

    • Include all likely oncogenes regardless of population frequency (“BRCA1” “BRCA2” “TP53” “PTEN” “ATM” “CHEK2” “PALB2” “APC” “MUTYH”)

  • Generated consensus MAF file

    • The above analysis generated two vcfs, one with the processed somatic variants and one with likely pathogenic germline variants

    • These variants were annotated as SOMATIC or GERMLINE_PATHOGENIC, respectively in the vcfs

    • I then concatenated these variants to one MAF file for the entire sample using the Nextflow vcf2maf pipeline.

This consensus MAF is what is used for the downstream analysis presented here.

Loading in the data

Packages Used
# Load packages

library(maftools) # For majority of maf file analysis
library(mclust)
library("BSgenome.Hsapiens.UCSC.hg38", quietly = TRUE)
library(NMF) # For signature calculation
library(pheatmap) # For nice heatmaps
library(ghibli) # For pretty colors

Load in raw MAF file and the clinical data

laml = read.maf(maf="input/merged_consensus_all_samples_germline.maf", 
                clinicalData ="input/NRG-GY00_laml_annot_2.csv",
                rmFlags = TRUE # "FLAGS, frequently mutated genes in public exomes"
                )
-Reading
-Validating
-Removing 20 FLAG genes
-Silent variants: 30668 
-Summarizing
-Processing clinical data
-Finished in 3.320s elapsed (3.420s cpu) 
Tip

Setting rmFlags = TRUE removes frequently mutated genes in public exomes

Details on these genes can be found here.

Data cleaning: setting levels for factor variables
# RECIST response (CR, PR, SD, PD)
laml@clinical.data$response <- factor(laml@clinical.data$response, levels=c("CR","PR","SD","PD"))
#table(laml@clinical.data$response)

# Detailed response including durable and progressive SD
laml@clinical.data$response2 <- factor(laml@clinical.data$response2, levels=c("CR","PR","SD_CB","SD_NCB","PD"))
# table(laml@clinical.data$response2)

# Response by treatment
laml@clinical.data$response_tx <- factor(laml@clinical.data$response_tx, levels=c("Nivo_CB","Combo_CB","Nivo_NCB","Combo_NCB"))
#table(laml@clinical.data$response_tx)

# Race (White, Non-White, and Not Reported)
laml@clinical.data$race2 <- factor(laml@clinical.data$race2, levels=c("W","NW","NR"))
#table(laml@clinical.data$race2)

# Site (dichotomous)
laml@clinical.data$Site2 <- factor(laml@clinical.data$Site2, levels=c("Adnexa","Non_Adnexa"))
#table(laml@clinical.data$Site2)

# Primary/Met
laml@clinical.data$Primary_Met <- factor(laml@clinical.data$Primary_Met, levels=c("P","M"))
#table(laml@clinical.data$Primary_Met)

Subsetting datasets

We have some normal samples with no matched tumor, and we may be interested in doing subtype-specific analysis.

Let’s subset to only tumor samples, and then create further subsetted datasets based on clinical characteristics.

Only tumor samples

I also annotated with pathogenic germline variants; the only relevant ones were in BRCA1 and BRCA2, so we are subsetting to somatic mutations and pathogenic germline variants.

laml_tum = subsetMaf(maf = laml, clinQuery = "Tissue %in% 'Tu'")
-subsetting by clinical data..
--86 samples meet the clinical query
-Processing clinical data
laml_tum = subsetMaf(maf = laml_tum, query = "vcf_id %in% c('SOMATIC','GERMLINE_PATHOGENIC') ")
-Processing clinical data

Other subsets

We can also subset by tumor histology, treatment, and response.

## Comparing Serous vs others
laml_ser = subsetMaf(maf = laml_tum, clinQuery = "celltype %in% 'Serous_Adenocarcinoma'")
-subsetting by clinical data..
--71 samples meet the clinical query
-Processing clinical data
# Subsetting the other dataset to "not serous"
`%ni%` <- Negate(`%in%`)
laml_other = subsetMaf(maf = laml_tum, clinQuery = "celltype %ni% 'Serous_Adenocarcinoma'")
-subsetting by clinical data..
--15 samples meet the clinical query
-Processing clinical data
## Comparing CB vs NCB
laml_CB = subsetMaf(maf = laml_tum, clinQuery = "responseCB %in% 'CB'")
-subsetting by clinical data..
--29 samples meet the clinical query
-Processing clinical data
laml_NCB = subsetMaf(maf = laml_tum, clinQuery = "responseCB %in% 'NCB'")
-subsetting by clinical data..
--57 samples meet the clinical query
-Processing clinical data
## Subsetting by treatment to see if there are differences
### There shouldn't be any, since treatment assignment was randomized, but is is good to confirm.
laml_nivo = subsetMaf(maf = laml_tum, clinQuery = "tx %in% 'Nivo'")
-subsetting by clinical data..
--43 samples meet the clinical query
-Processing clinical data
laml_combo = subsetMaf(maf = laml_tum, clinQuery = "tx %in% 'Combo'")
-subsetting by clinical data..
--43 samples meet the clinical query
-Processing clinical data

View the MAF file summary

Here is a basic summary of the MAF file

laml_tum
An object of class  MAF 
                        ID summary    Mean Median
                    <char>  <char>   <num>  <num>
 1:             NCBI_Build  GRCh38      NA     NA
 2:                 Center       .      NA     NA
 3:                Samples      86      NA     NA
 4:                 nGenes    6897      NA     NA
 5:        Frame_Shift_Del     234   2.721    2.0
 6:        Frame_Shift_Ins      43   0.500    0.0
 7:           In_Frame_Del      83   0.965    0.0
 8:           In_Frame_Ins       6   0.070    0.0
 9:      Missense_Mutation    9193 106.895   81.0
10:      Nonsense_Mutation     524   6.093    4.0
11:       Nonstop_Mutation      13   0.151    0.0
12:            Splice_Site     505   5.872    3.0
13: Translation_Start_Site      19   0.221    0.0
14:                  total   10620 123.488   92.5
These are some more summaries you can generate
# I'm commenting them out as they have too much info for the tables to be readable.

# #Shows sample summary.
# getSampleSummary(laml_tum)

# #Shows gene summary.
# getGeneSummary(laml_tum)

# #shows clinical data associated with samples
# getClinicalData(laml_tum)

# #Shows all fields in MAF
# getFields(laml_tum)

Visual Summaries

Summarize the maf file

We can use plotmafSummary to plot the summary of the maf file, which displays number of variants in each sample as a stacked barplot and variant types as a boxplot summarized by Variant_Classification.

plotmafSummary(maf = laml_tum, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)

#Use mafbarplot for a minimal barplot of mutated genes.
mafbarplot(maf =  laml_tum)

Summarize the other subsets

Here are the mutation distributions for the other subsets.

As a sanity check, the majority of Serous samples have TP53 mutations, while there are few TP53 mutations in the top 10 for the non-Serous samples (as is expected).

By Histology

Serous

## Serous 
plotmafSummary(maf = laml_ser, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)

mafbarplot(maf =  laml_ser)

Non-Serous

## Non-serous
plotmafSummary(maf = laml_other, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)

mafbarplot(maf =  laml_other)

By Response

CB

## CB
plotmafSummary(maf = laml_CB, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)

mafbarplot(maf =  laml_CB)

NCB

## NCB
plotmafSummary(maf = laml_NCB, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)

mafbarplot(maf =  laml_NCB)

By Treatment

Nivo

## Nivo
plotmafSummary(maf = laml_nivo, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)

mafbarplot(maf =  laml_nivo)

Combo

## Combo
plotmafSummary(maf = laml_combo, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)

mafbarplot(maf =  laml_combo)

Oncoprints

Oncoprints (or “oncoplots” as the wrapper function in maftools is called) summarize complex genomic features of a given dataset.

#oncoplot for top ten mutated genes.
oncoplot(maf = laml_tum, top = 10)

These are the key features and how they are interpreted:

  • Columns represent individual patients.
    • Reading column-wise, you can see the collection of alterations in a patient’s tumor for the given set of genes.
    • The bar plot on the top is a count of tumor mutation burden per patient, color-coded by mutation type.
  • Rows are altered genes.
    • Alterations are color-coded to indicate their type (e.g. mutation, deletion, insertion)
    • Genes can be further annotated with key features (e.g. high impact/likely pathogenic mutations, germline variants, etc.)
  • The bar plot on the right summarizes alterations in a given gene for the entire dataset.
  • Many more features and annotations can be added to further characterize the dataset.
# Highlight by specific attribute in MAF file
oncoplot(maf = laml_tum, 
         additionalFeature = c("IMPACT", "HIGH"))

# Add transitions/transversions  
oncoplot(maf = laml_tum, draw_titv = TRUE)

# cBioPortal alteration nomenclature 
oncoplot(maf = laml_tum, cBioPortal = TRUE)

Oncoprints by clinical data

## Primary vs metastatic sites
oncoplot(maf = laml_tum, clinicalFeatures = 'Primary_Met', sortByAnnotation = TRUE)

## Cell type
oncoplot(maf = laml_tum, clinicalFeatures = 'celltype', sortByAnnotation = TRUE)

## ICI response
oncoplot(maf = laml_tum, clinicalFeatures = c('responseCB','response2'), sortByAnnotation = TRUE)

## ICI response and treatment
oncoplot(maf = laml_tum, clinicalFeatures = 'response_tx', sortByAnnotation = TRUE)

## Mutational signature
oncoplot(maf = laml_tum, clinicalFeatures = 'Signature', sortByAnnotation = TRUE)

## Race
oncoplot(maf = laml_tum, clinicalFeatures = 'race', sortByAnnotation = TRUE)

## Site
oncoplot(maf = laml_tum, clinicalFeatures = c('Site2','Site'), sortByAnnotation = TRUE)

## Stratified by histology
## Serous
oncoplot(maf = laml_ser, clinicalFeatures = c('Site2','Site','celltype'), sortByAnnotation = TRUE)

## Non-Serous
oncoplot(maf = laml_other, clinicalFeatures = c('Site2','Site','celltype'), sortByAnnotation = TRUE)

Oncogenic signaling pathways

sigpw plots enrichment for known oncogenic signaling pathways as defined in TCGA, Sanchez/Vega et al.

# Plot genes in 2 top oncogenic signaling pathways
oncoplot(maf = laml_tum, pathways = "sigpw", gene_mar = 8, fontSize = 0.6, topPathways = 2)

# Collapse to just plot the pathways, now top 5
oncoplot(maf = laml_tum, pathways = "sigpw", gene_mar = 8, fontSize = 0.6, topPathways = 5, 
         collapsePathway = TRUE)

Pathways by Response

# Response
oncoplot(maf = laml_tum, pathways = "sigpw", gene_mar = 8, fontSize = 0.6, topPathways = 5, 
         collapsePathway = TRUE,
         clinicalFeatures = 'responseCB', sortByAnnotation = TRUE)

# Response by treatment
oncoplot(maf = laml_tum, pathways = "sigpw", gene_mar = 8, fontSize = 0.6, topPathways = 5, 
         collapsePathway = TRUE,
         clinicalFeatures = 'response_tx', sortByAnnotation = TRUE)

Pathways by Histology

Since these are the top pathways for the whole cohort, they may be masking histologically-specific pathways.

This is where it is useful to subset the data. Here I am plotting the top pathways for serous and all non-serous patients separately.

# Serous
oncoplot(maf = laml_ser, pathways = "sigpw", gene_mar = 8, fontSize = 0.6, topPathways = 5, 
         collapsePathway = TRUE,
         clinicalFeatures = 'celltype')

# Non-serous 
oncoplot(maf = laml_other, pathways = "sigpw", gene_mar = 8, fontSize = 0.6, topPathways = 5, 
         collapsePathway = TRUE,
         clinicalFeatures = 'celltype')

Biological processes of known tumor drivers

smgbp plots enrichment for pan-cancer significantly mutated genes, classified by biological processes, per Bailey et al.

Here are the top 2 biological processes of known drivers

# Note that I have highlighted the BRCA1/2 germline variants
oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.8, topPathways = 2, 
         additionalFeature = c("vcf_id", "GERMLINE_PATHOGENIC"))

Pathways by Response & Histology

# Response
oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.8, topPathways = 2,
         clinicalFeatures = 'responseCB', sortByAnnotation = TRUE,
         additionalFeature = c("vcf_id", "GERMLINE_PATHOGENIC"))

# Detailed response (note that none of the germline variants had CR or PR)
oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.8, topPathways = 2,
         clinicalFeatures = 'response2', sortByAnnotation = TRUE,
         additionalFeature = c("vcf_id", "GERMLINE_PATHOGENIC"))

# Response by treatment 
oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.8, topPathways = 2,
         clinicalFeatures = 'response_tx', sortByAnnotation = TRUE,
         additionalFeature = c("vcf_id", "GERMLINE_PATHOGENIC"))

# Histology
oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.8, topPathways = 2,
         clinicalFeatures = 'celltype', sortByAnnotation = TRUE,
         additionalFeature = c("vcf_id", "GERMLINE_PATHOGENIC"))

Pathways by mutational signature*

oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.8, topPathways = 2,
         clinicalFeatures = 'Signature', sortByAnnotation = TRUE,
         additionalFeature = c("vcf_id", "GERMLINE_PATHOGENIC"))

Note to Kelsey #1

*Will need to return to this, but it seems like sig1 is more BRCA1 while sig2 is more BRCA2.

I need to re-call the mutational signatures (downstream of these analyses in my original script) and update this section.

Top pathways

Now collapsing the plots to show only the pathways

# Top 5 pathways 
oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.6, topPathways = 5, 
         collapsePathway = TRUE)

# Top 10 pathways*
# *I've set `topPathways` = 10 but this is 24 pathways, not sure what's going on there...
oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.6, topPathways = 10, 
         collapsePathway = TRUE)

# By response
oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.6, topPathways = 10, 
         collapsePathway = TRUE,
         clinicalFeatures = 'responseCB', sortByAnnotation = TRUE)

# By response and treatment
oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.6, topPathways = 10, 
         collapsePathway = TRUE,
         clinicalFeatures = 'response_tx', sortByAnnotation = TRUE)

# By histology
oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.6, topPathways = 10, 
         collapsePathway = TRUE,
         clinicalFeatures = 'celltype', sortByAnnotation = TRUE)

Top pathways by histology

As with the oncogenic signaling pathways, I am plotting these separately for serous and non-serous.

# Serous
oncoplot(maf = laml_ser, pathways = "smgbp", gene_mar = 8, fontSize = 0.6, topPathways = 10, 
         collapsePathway = TRUE,
         clinicalFeatures = 'celltype')

# Non-serous
oncoplot(maf = laml_other, pathways = "smgbp", gene_mar = 8, fontSize = 0.6, topPathways = 10,
         collapsePathway = TRUE,
         clinicalFeatures = 'celltype')

Additional oncoprints with more complex annotations

Code
## Color coding
ghibli_colors <- ghibli_palette("PonyoMedium", type = "discrete")
# ghibli_colors
respcolors <- c("#278B9AFF","#E75B64FF")
names(respcolors) = c("CB","NCB")
fabcolors = RColorBrewer::brewer.pal(n = 5,name = 'Spectral')
names(fabcolors) = c("CR", "PR", "SD_CB", "SD_NCB", "PD")
txcolors = c("#BEAED4","#7FC97F")
names(txcolors) = c("Nivo","Combo")
cellcolors = RColorBrewer::brewer.pal(n = 6,name = 'PRGn')
names(cellcolors) = c("Serous_Adenocarcinoma","Endometrioid_Adenocarcinoma","Adenocarcinoma",
                      "Clear_Cell","Mixed_Epithelial","Undifferentiated_Carcinoma")


anno_cols = list(tx = txcolors, response2 = fabcolors, Survtime = "Blues", celltype = cellcolors)
anno_cols2 = list(tx = txcolors, responseCB = respcolors, Survtime = "Blues", celltype = cellcolors)
#print(anno_cols2)
oncoplot(
  maf = laml_tum,
  clinicalFeatures = c('response2','tx',  'Survtime','celltype'),
  sortByAnnotation = TRUE,
  annotationColor = anno_cols
)

oncoplot(
  maf = laml_tum,
  clinicalFeatures = c('responseCB','tx','Survtime','celltype'),
  sortByAnnotation = TRUE,
  draw_titv = TRUE,
  annotationColor = anno_cols2,
  pathways = "smgbp", gene_mar = 8, fontSize = 0.6, topPathways = 2,
  additionalFeature = c("vcf_id", "GERMLINE_PATHOGENIC")
)

Other Visualizations

Transitions and transversions

Transitions are changes between two purines (A <-> G) or two pyrimidines (C <-> T), involving bases of simliar shapes.

Transversions are changes from a purine to a pyrimidine or vice versa, resulting in the exchange of one-ring and two-ring structures.

Here’s a nice picture from this page:

Transitions tend to be more common than transversions, despite there being twice as many possible transversions.

The titiv function summarizes these for the dataset:

laml.titv = titv(maf = laml_tum, plot = FALSE, useSyn = TRUE)
#plot titv summary
plotTiTv(res = laml.titv, plotNotch = TRUE)

Here are the plots separated by histology:

# Serous
laml.titv.ser = titv(maf = laml_ser, plot = FALSE, useSyn = TRUE)
plotTiTv(res = laml.titv.ser, plotNotch = TRUE)

# Non-Serous
laml.titv.other = titv(maf = laml_other, plot = FALSE, useSyn = TRUE)
plotTiTv(res = laml.titv.other)

Lollipop plots

Draws a lollipop plot of amino acid changes on a protein structure (protein domains are derived from the PFAM database).

Let’s look at the plot for TP53 (the most mutated gene in overall ovarian cohort)

# Lollipop plot for TP53
lollipopPlot(
  maf = laml_tum,
  gene = 'TP53',
  showMutationRate = TRUE
)
     HGNC    refseq.ID   protein.ID aa.length
   <char>       <char>       <char>     <num>
1:   TP53    NM_000546    NP_000537       393
2:   TP53 NM_001126112 NP_001119584       393
3:   TP53 NM_001126113 NP_001119585       346
4:   TP53 NM_001126114 NP_001119586       341
5:   TP53 NM_001126115 NP_001119587       261
6:   TP53 NM_001126116 NP_001119588       209
7:   TP53 NM_001126117 NP_001119589       214
8:   TP53 NM_001126118 NP_001119590       354

However, we know it’s the most commonly mutated gene because the majority of the samples are Serous.

This is what it looks like stratified by histology:

# Serous
lollipopPlot(
  maf = laml_ser,
  gene = 'TP53',
  showMutationRate = TRUE
)
     HGNC    refseq.ID   protein.ID aa.length
   <char>       <char>       <char>     <num>
1:   TP53    NM_000546    NP_000537       393
2:   TP53 NM_001126112 NP_001119584       393
3:   TP53 NM_001126113 NP_001119585       346
4:   TP53 NM_001126114 NP_001119586       341
5:   TP53 NM_001126115 NP_001119587       261
6:   TP53 NM_001126116 NP_001119588       209
7:   TP53 NM_001126117 NP_001119589       214
8:   TP53 NM_001126118 NP_001119590       354

# Non-Serous
lollipopPlot(
  maf = laml_other,
  gene = 'TP53',
  showMutationRate = TRUE
)
     HGNC    refseq.ID   protein.ID aa.length
   <char>       <char>       <char>     <num>
1:   TP53    NM_000546    NP_000537       393
2:   TP53 NM_001126112 NP_001119584       393
3:   TP53 NM_001126113 NP_001119585       346
4:   TP53 NM_001126114 NP_001119586       341
5:   TP53 NM_001126115 NP_001119587       261
6:   TP53 NM_001126116 NP_001119588       209
7:   TP53 NM_001126117 NP_001119589       214
8:   TP53 NM_001126118 NP_001119590       354

What about ARID1A? Here is the full cohort:

# Lollipop plot for ARID1A
lollipopPlot(
  maf = laml_tum,
  gene = 'ARID1A',
  showMutationRate = TRUE
)
     HGNC refseq.ID protein.ID aa.length
   <char>    <char>     <char>     <num>
1: ARID1A NM_006015  NP_006006      2285
2: ARID1A NM_139135  NP_624361      2068

And stratified by histology:

# Serous
lollipopPlot(
  maf = laml_ser,
  gene = 'ARID1A',
  showMutationRate = TRUE
)
     HGNC refseq.ID protein.ID aa.length
   <char>    <char>     <char>     <num>
1: ARID1A NM_006015  NP_006006      2285
2: ARID1A NM_139135  NP_624361      2068

# Non-Serous
lollipopPlot(
  maf = laml_other,
  gene = 'ARID1A',
  showMutationRate = TRUE
)
     HGNC refseq.ID protein.ID aa.length
   <char>    <char>     <char>     <num>
1: ARID1A NM_006015  NP_006006      2285
2: ARID1A NM_139135  NP_624361      2068

Note to Kelsey #2

I need to better understand what specifically is being plotted here; does this mean that all the mutations are found in introns and not actually protein-coding regions?

These regions don’t seem to map exactly to the exons when I look at ARID1A in UCSC Genome Browser.

Rainfall plots

Rainfall plots illustrate “hypermutated” genomic regions.

Setting detectChangePoints = TRUE detects genomic change points where potential kataegis are found.

Kataegis are genomic segments containing six or more consecutive mutations with an average inter-mutation distance of <=1,000 bp.

From the maftools help page for rainfallPlot():

Kategis detection algorithm by Moritz Goretzky at WWU Munster, which exploits the definition of Kategis (six consecutive mutations with an avg. distance of 1000 bp ) to identify hyper mutated genomic loci.

  • Algorithm starts with a double-ended queue to which six consecutive mutations are added and their average inter-mutation distance is calculated.
  • If the average inter-mutation distance is larger than 1000, one element is added at the back of the queue and one is removed from the front.
  • If the average inter-mutation distance is less or equal to 1000, further mutations are added until the average inter-mutation distance is larger than 1000.
  • After that, all mutations in the double-ended queue are written into output as one kataegis and the double-ended queue is reinitialized with six mutations.

By default, it plots the most mutated sample.

rainfallPlot(
  maf = laml_tum, 
  detectChangePoints = TRUE, 
  pointSize = 0.4,
  ref.build = "hg38"
)
Processing GADCHA..
No changepoints detected!

# Here's a sample with kataegis
rainfallPlot(
  maf = laml_tum, 
  detectChangePoints = TRUE, 
  pointSize = 0.4,
   tsb = "GADGAE",
  ref.build = "hg38"
 )
Processing GADGAE..
Kataegis detected at:
   Chromosome Start_Position End_Position nMuts Avg_intermutation_dist  Size
        <num>          <num>        <num> <int>                  <num> <num>
1:         11        1179999      1182583     6                  516.8  2584
   Tumor_Sample_Barcode   C>G   C>T   T>C
                 <fctr> <int> <int> <int>
1:               GADGAE     1     3     2

Compare TMB to TCGA

## Capture size for Twist Bioscience Human Comprehensive Exome kit (used for WES) is 36.8Mb
laml.mutload = tcgaCompare(maf = laml_tum, cohortName = 'NRG-GY003', logscale = TRUE, capture_size = 36.8)
Warning in FUN(X[[i]], ...): Removed 0 samples with zero mutations.
Capture size [TCGA]:  35.8
Capture size [Input]: 36.8
Performing pairwise t-test for differences in mutation burden (per MB)..

laml.mutload = tcgaCompare(maf = laml_tum, cohortName = 'NRG-GY003', logscale = TRUE)
Warning in FUN(X[[i]], ...): Removed 0 samples with zero mutations.
Performing pairwise t-test for differences in mutation burden..

# This seems...unlikely??

# Just in serous
laml.mutload = tcgaCompare(maf = laml_ser, cohortName = 'NRG-GY003', logscale = TRUE, capture_size = 36.8)
Warning in FUN(X[[i]], ...): Removed 0 samples with zero mutations.
Capture size [TCGA]:  35.8
Capture size [Input]: 36.8
Performing pairwise t-test for differences in mutation burden (per MB)..

# Other
laml.mutload = tcgaCompare(maf = laml_other, cohortName = 'NRG-GY003', logscale = TRUE, capture_size = 36.8)
Warning in FUN(X[[i]], ...): Removed 0 samples with zero mutations.
Capture size [TCGA]:  35.8
Capture size [Input]: 36.8
Performing pairwise t-test for differences in mutation burden (per MB)..